Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve sample weight adjustment #1441

Merged
merged 4 commits into from
Jun 27, 2017
Merged

Improve sample weight adjustment #1441

merged 4 commits into from
Jun 27, 2017

Conversation

martinholmer
Copy link
Collaborator

@martinholmer martinholmer commented Jun 27, 2017

This pull request attempts to improve the logic of scaling up sub-sample weights so that they produces results that are closer to those produced using a full sample. The proposed changes attempt to resolve the issues discussed in #1439 and #1440.

@MattHJensen @feenberg @hdoupe

@codecov-io
Copy link

codecov-io commented Jun 27, 2017

Codecov Report

Merging #1441 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1441      +/-   ##
==========================================
+ Coverage   99.71%   99.71%   +<.01%     
==========================================
  Files          40       40              
  Lines        2782     2785       +3     
==========================================
+ Hits         2774     2777       +3     
  Misses          8        8
Impacted Files Coverage Δ
taxcalc/records.py 98.79% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0f46a04...52242f5. Read the comment docs.

@martinholmer
Copy link
Collaborator Author

@hdoupe, Could you review the changes in pull request #1441 to make sure they are sensible given your recent work on this topic? Thanks.

@hdoupe
Copy link
Collaborator

hdoupe commented Jun 27, 2017

@martinholmer this looks sensible. What do you think about doing it this way?

WT_full_sum = self.WT.sum()
self.WT = self.WT.iloc[self.index]
WT_sub_sum = self.WT.sum()
WT_ratio = WT_full_sum / WT_sub_sum

self.WT = self.WT * WT_ratio

These should be equivalent, but this may be a bit faster. However, I'm not sure if it would matter too much given that we are only dealing with 10 years of data.

@martinholmer
Copy link
Collaborator Author

@hdoupe said:

this [#1441] looks sensible. What do you think about doing it this way?

WT_full_sum = self.WT.sum()
self.WT = self.WT.iloc[self.index]
WT_sub_sum = self.WT.sum()
WT_ratio = WT_full_sum / WT_sub_sum
self.WT = self.WT * WT_ratio

These should be equivalent, but this way may be a bit faster. However, I'm not sure if it would matter too much given that we are only dealing with 10 years of data.

PR#1441 and what you suggest above would be equivalent if the weights were the same in each year. But the weights are different in each year, so I don't think they are equivalent. In addition to resolving your issue #1439, pull request #1441 is trying to resolve my issue #1440.

Does this make sense?

@hdoupe
Copy link
Collaborator

hdoupe commented Jun 27, 2017

@martinholmer , I think they should be the same. WT_full_sum and WT_sub_sum are pandas Series objects of length 18 with each entry being the total sum of weights for 2009, 2010, ..., 2026. Then, WT_ratio is a pandas Series object where the i-th item is the i-th item of WT_full_sum divided by the i-th item of WT_sub_sum.

Finally, self.WT * WT_ratio multiplies the i-th column of self.WT by the i-th item of WT_ratio.

If my understanding is correct, this is just another way to solve the problem but without using loops. Whichever implementation you choose is probably just a matter of preference in this situation.

@martinholmer
Copy link
Collaborator Author

@hdoupe,
Thanks for the tutorial on the magic of Pandas, which is something I've had trouble getting.

I tried to use your code fragment in place of what is in #1441, but did not get the same result. In fact, the test failed even though it passes with the current code.

The most likely answer is that I made a mistake in applying your code. Maybe you can take a look.
Here is what I tried:

        # weights must be same size as tax record data
        if not self.WT.empty and self.dim != len(self.WT):
            WT_full_sum = self.WT.sum()
            self.WT = self.WT.iloc[self.index]
            WT_sub_sum = self.WT.sum()
            WT_ratio = WT_full_sum / WT_sub_sum
            self.WT = self.WT * WT_ratio

But instead of passing the test_agg test in the test_pufcsv.py file, I get this failure message:

E           ValueError: PUFCSV AGG RESULTS DIFFER IN SUB-SAMPLE AND FULL-SAMPLE
E           WHEN subfrac=0.030, rtol=0.0100, seed=180
E            year,sub,full,reldiff= 2013	2199.38	2176.94	0.0103
E            year,sub,full,reldiff= 2017	2821.50	2790.64	0.0111
E            year,sub,full,reldiff= 2018	2931.17	2899.85	0.0108
E            year,sub,full,reldiff= 2020	3136.53	3105.29	0.0101
E            year,sub,full,reldiff= 2021	3266.19	3230.47	0.0111
E            year,sub,full,reldiff= 2022	3416.16	3369.21	0.0139
tests/test_pufcsv.py:111: ValueError

Any suggestions would be welcome as I am far from being a Pandas expert.

@hdoupe
Copy link
Collaborator

hdoupe commented Jun 27, 2017

@martinholmer No problem, glad I can help. I ran the test with the first implementation and the second implementation and got the same results for each implementation:

E           ValueError: PUFCSV AGG RESULTS DIFFER IN SUB-SAMPLE AND FULL-SAMPLE
E           WHEN subfrac=0.030, rtol=0.0100, seed=180
E            year,sub,full,reldiff= 2013	2199.38	2176.94	0.0103
E            year,sub,full,reldiff= 2017	2821.50	2790.64	0.0111
E            year,sub,full,reldiff= 2018	2931.17	2899.85	0.0108
E            year,sub,full,reldiff= 2020	3136.53	3105.29	0.0101
E            year,sub,full,reldiff= 2021	3266.19	3230.47	0.0111
E            year,sub,full,reldiff= 2022	3416.16	3369.21	0.0139

tests/test_pufcsv.py:111: ValueError

The adjusted weights should sum to the same amount as the full sample weights. I tested this here:

from taxcalc.records import Records
import pandas as pd

puf = pd.read_csv("taxcalc/puf.csv")
puf_samp = puf.sample(frac=0.03, random_state=180)

rec = Records(data = puf_samp)
wt = pd.read_csv("taxcalc/puf_weights.csv")

print(wt.sum() - rec.WT.sum())

Both implementations were off by the same amount:

WT2009    0.000109
WT2010   -0.000067
WT2011   -0.000021
WT2012   -0.000074
WT2013    0.000086
WT2014    0.000078
WT2015    0.000147
WT2016   -0.000093
WT2017   -0.000097
WT2018    0.000175
WT2019   -0.000130
WT2020   -0.000175
WT2021    0.000084
WT2022   -0.000065
WT2023    0.000156
WT2024    0.000072
WT2025    0.000092
WT2026   -0.000191
dtype: float64

Are you sure that it passed the test on the first implementation? My tests indicate that the two implementations are producing the same result.

@martinholmer
Copy link
Collaborator Author

@hdoupe, Thanks again for all the help on Pandas usage. You are absolutely correct. I had no idea that Pandas would do sums for all the DataFrame columns when you do dframe.sum(). I discovered we were getting different results with my column-loop code (which should have been logically equivalent but slower) because of a Python 2.7 versus Python 3.x difference: 5/2=2 in Python 2.7, but 5/2=2.5 in Python 3.x. So my now-removed column-loop code was not accurate because the weight scale-up factor was an interger (since the weights in the puf_weights.csv file are integers).

Again, thanks for all the help. I'm going to merge this very soon.

@MattHJensen @feenberg @Amy-Xu @andersonfrailey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants